This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
You are working for a telecommunications company in their customer analytics team. Your manager wants you to analyze customer data and provide insights for retaining existing customers. A big part of customer relationship management involves efforts to retain existing customers. Note that the costs incurred to attract new customers are much larger than the costs to retain existing customers. As part of the class project, your goal is to analyze customer data and identify the ones that are likely to leave so that you can take some action to retain them by offering incentives. The data is spread over two .rda files, Model_Building_Data and Implementation_Data. Model_Building_Data should be used to build the model. It contains current customers as well as customers who have left. The description of the columns in this dataset is below:
# Load the dataset
load("Model_Building_Data.rda")
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Use glimpse to view variable types and preview values
glimpse(Model_Building_Data)
## Rows: 6,499
## Columns: 17
## $ Gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <chr> "Current", "Current", "Left", "Current", "Left", "Lef…
We are converting Status and SeniorCitizen
to factors because both are categorical variables — Status
is the target label for classification, and SeniorCitizen
is a binary (0/1) feature that should be treated as a category, not a
numeric scale.
# Convert the target variable 'Status' from character to factor
# This allows classification models to treat it as a category
Model_Building_Data$Status <- as.factor(Model_Building_Data$Status)
# Convert 'SeniorCitizen' from integer (0/1) to factor
# 0 = Not a senior, 1 = Senior — it's a binary category, not a numeric trend
# Treating it as a factor helps models understand it's a yes/no type variable
Model_Building_Data$SeniorCitizen <- as.factor(Model_Building_Data$SeniorCitizen)
# Check the updated structure of the dataset after conversions
glimpse(Model_Building_Data)
## Rows: 6,499
## Columns: 17
## $ Gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <fct> Current, Current, Left, Current, Left, Left, Current,…
# Check total missing values column-wise
# This helps identify which variables (if any) have NA values
colSums(is.na(Model_Building_Data))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity DeviceProtection StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Status
## 0
# Remove rows with missing values (11 rows have missing TotalCharges)
# This is a safe choice since only a small number of rows are affected
Model_Building_Data <- na.omit(Model_Building_Data)
# Re-check to confirm missing values are gone
colSums(is.na(Model_Building_Data))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity DeviceProtection StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Status
## 0
We removed 11 rows with missing values in TotalCharges
using na.omit() to ensure the dataset is clean and ready
for modeling. Since only a small number of rows are affected we have
choosen to remove them.
# Summary statistics for numerical variables
summary(Model_Building_Data[, c("Tenure", "MonthlyCharges", "TotalCharges")])
## Tenure MonthlyCharges TotalCharges
## Min. : 1.00 Min. : 18.25 Min. : 18.8
## 1st Qu.: 9.00 1st Qu.: 35.29 1st Qu.: 396.2
## Median :29.00 Median : 70.35 Median :1389.7
## Mean :32.31 Mean : 64.70 Mean :2273.5
## 3rd Qu.:55.00 3rd Qu.: 89.90 3rd Qu.:3773.4
## Max. :72.00 Max. :118.75 Max. :8684.8
# Standard deviation for numeric variables
sapply(Model_Building_Data[, c("Tenure", "MonthlyCharges", "TotalCharges")], sd)
## Tenure MonthlyCharges TotalCharges
## 24.56461 30.18221 2268.75008
# Frequency tables for a few categorical variables
table(Model_Building_Data$Contract)
##
## Month-to-month One year Two year
## 3581 1351 1556
table(Model_Building_Data$PaymentMethod)
##
## Bank transfer (automatic) Credit card (automatic) Electronic check
## 1423 1393 2171
## Mailed check
## 1501
table(Model_Building_Data$InternetService)
##
## DSL Fiber optic No
## 2216 2855 1417
# Load ggplot2 for plotting
library(ggplot2)
# Plot 1: Gender Distribution
# This bar plot shows how many male and female customers are in the dataset.
ggplot(Model_Building_Data, aes(x = Gender)) +
geom_bar(fill = "skyblue") +
labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
theme_minimal()
The dataset has a nearly equal number of male and female customers.
# Distribution of customer status (Current or Left)
ggplot(Model_Building_Data, aes(x = Status)) +
geom_bar(fill = "skyblue") +
labs(title = "Customer Status Distribution", x = "Status", y = "Count") +
theme_minimal()
# Distribution of Senior Citizen status (0 = Not a senior, 1 = Senior)
ggplot(Model_Building_Data, aes(x = SeniorCitizen)) +
geom_bar(fill = "orange") +
labs(title = "Senior Citizen Status", x = "Senior Citizen (0 = No, 1 = Yes)", y = "Count") +
theme_minimal()
# Distribution of Partner status (Yes or No)
ggplot(Model_Building_Data, aes(x = Partner)) +
geom_bar(fill = "mediumorchid") +
labs(title = "Partner Status", x = "Has Partner", y = "Count") +
theme_minimal()
# Distribution of Dependents (Yes or No)
ggplot(Model_Building_Data, aes(x = Dependents)) +
geom_bar(fill = "darkorange") +
labs(title = "Dependents Status", x = "Has Dependents", y = "Count") +
theme_minimal()
# Distribution of tenure
# This histogram shows how long customers have stayed with the company (in months).
ggplot(Model_Building_Data, aes(x = Tenure)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(title = "Distribution of Customer Tenure", x = "Tenure (Months)", y = "Count") +
theme_minimal()
# Distribution of Phone Service
# This bar plot shows how many customers have phone service versus those who do not.
ggplot(Model_Building_Data, aes(x = PhoneService)) +
geom_bar(fill = "mediumseagreen") +
labs(title = "Distribution of Phone Service", x = "Phone Service", y = "Count") +
theme_minimal()
# Distribution of Multiple Lines
# This bar plot shows how many customers have multiple phone lines, do not have them, or have no phone service.
ggplot(Model_Building_Data, aes(x = MultipleLines)) +
geom_bar(fill = "darkgoldenrod1") +
labs(title = "Distribution of Multiple Lines", x = "Multiple Lines", y = "Count") +
theme_minimal()
# Distribution of Internet Service
# This bar plot shows the types of internet services customers are subscribed to.
ggplot(Model_Building_Data, aes(x = InternetService)) +
geom_bar(fill = "cornflowerblue") +
labs(title = "Distribution of Internet Service", x = "Internet Service Type", y = "Count") +
theme_minimal()
# Distribution of Online Security
# This bar plot shows how many customers have online security, don't have it, or have no internet service.
ggplot(Model_Building_Data, aes(x = OnlineSecurity)) +
geom_bar(fill = "indianred2") +
labs(title = "Distribution of Online Security", x = "Online Security", y = "Count") +
theme_minimal()
# Distribution of Device Protection
# This bar plot shows how many customers have device protection, do not have it, or have no internet service.
ggplot(Model_Building_Data, aes(x = DeviceProtection)) +
geom_bar(fill = "deepskyblue3") +
labs(title = "Distribution of Device Protection", x = "Device Protection", y = "Count") +
theme_minimal()
# Distribution of Contract Types
# This bar plot shows the number of customers based on their contract length.
ggplot(Model_Building_Data, aes(x = Contract)) +
geom_bar(fill = "darkslateblue") +
labs(title = "Distribution of Contract Type", x = "Contract Type", y = "Count") +
theme_minimal()
# Distribution of Paperless Billing
# This bar plot shows how many customers have enabled paperless billing.
ggplot(Model_Building_Data, aes(x = PaperlessBilling)) +
geom_bar(fill = "tomato") +
labs(title = "Distribution of Paperless Billing", x = "Paperless Billing", y = "Count") +
theme_minimal()
# Distribution of Payment Methods
# This bar plot shows the number of customers using each payment method.
ggplot(Model_Building_Data, aes(x = PaymentMethod)) +
geom_bar(fill = "goldenrod") +
labs(title = "Distribution of Payment Methods", x = "Payment Method", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
# Distribution of Monthly Charges
# This histogram shows how much customers are charged monthly.
ggplot(Model_Building_Data, aes(x = MonthlyCharges)) +
geom_histogram(binwidth = 5, fill = "mediumturquoise", color = "white") +
labs(title = "Distribution of Monthly Charges", x = "Monthly Charges ($)", y = "Count") +
theme_minimal()
# Distribution of Total Charges
# This histogram shows the total amount charged to customers over their entire tenure.
ggplot(Model_Building_Data, aes(x = TotalCharges)) +
geom_histogram(binwidth = 250, fill = "mediumpurple", color = "white") +
labs(title = "Distribution of Total Charges", x = "Total Charges ($)", y = "Count") +
theme_minimal()
The distribution of total charges is highly right-skewed. Most customers have paid under $2,000 over their entire tenure, while a smaller group has significantly higher total charges, likely due to longer relationships or higher service usage.
Churn refers to the situation when a customer leaves or stops using a company’s service.
In our project, churn is represented by:
Status = “Left” → the customer has churned Status = “Current” → the customer is still with the company
# Plot: Distribution of Status (Current vs Left)
ggplot(Model_Building_Data, aes(x = Status)) +
geom_bar(fill = "steelblue") +
labs(title = "Customer Churn Distribution",
x = "Customer Status", y = "Count") +
theme_minimal()
# Plot: Churn distribution by contract type
ggplot(Model_Building_Data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") + # This gives proportional stacked bars
labs(title = "Churn Rate by Contract Type",
x = "Contract Type", y = "Proportion") +
theme_minimal()
# Plot: Churn rate by OnlineSecurity status
ggplot(Model_Building_Data, aes(x = OnlineSecurity, fill = Status)) +
geom_bar(position = "fill") + # Proportional stacked bars
labs(title = "Churn Rate by Online Security",
x = "Online Security Service", y = "Proportion") +
theme_minimal()
# Plot: Churn rate by customer tenure
ggplot(Model_Building_Data, aes(x = Tenure, fill = Status)) +
geom_histogram(binwidth = 5, position = "fill") + # stacked by proportion
labs(title = "Churn Rate by Customer Tenure",
x = "Tenure (Months)", y = "Proportion") +
theme_minimal()
# Plot: Churn rate by senior citizen status
ggplot(Model_Building_Data, aes(x = SeniorCitizen, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Churn Rate by Senior Citizen Status",
x = "Senior Citizen (0 = No, 1 = Yes)", y = "Proportion") +
theme_minimal()
# Plot: Churn rate by type of Internet Service
ggplot(Model_Building_Data, aes(x = InternetService, fill = Status)) +
geom_bar(position = "fill") + # Shows proportion of churn per service type
labs(title = "Churn Rate by Internet Service Type",
x = "Internet Service", y = "Proportion") +
theme_minimal()
# Set seed for reproducibility
set.seed(123)
# Generate sample row indices for training (80% of the data)
train_indices <- sample(nrow(Model_Building_Data), 0.8 * nrow(Model_Building_Data))
# Split the dataset
train_data <- Model_Building_Data[train_indices, ]
test_data <- Model_Building_Data[-train_indices, ]
Model 1
# Logistic regression model using Status as the response variable
log_model_full <- glm(Status ~ SeniorCitizen + Partner + Dependents + Tenure + PhoneService +
MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
Contract + PaperlessBilling + PaymentMethod + MonthlyCharges +
TotalCharges + Gender,
data = train_data,
family = "binomial")
summary(log_model_full)
##
## Call:
## glm(formula = Status ~ SeniorCitizen + Partner + Dependents +
## Tenure + PhoneService + MultipleLines + InternetService +
## OnlineSecurity + DeviceProtection + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges + Gender, family = "binomial",
## data = train_data)
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.363e-02 2.722e-01 0.307 0.758630
## SeniorCitizen1 2.554e-01 9.699e-02 2.634 0.008443 **
## PartnerYes 1.543e-02 9.129e-02 0.169 0.865816
## DependentsYes -2.181e-01 1.060e-01 -2.057 0.039660 *
## Tenure -6.819e-02 7.395e-03 -9.221 < 2e-16 ***
## PhoneServiceYes -9.212e-01 1.804e-01 -5.106 3.30e-07 ***
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.287e-01 9.612e-02 2.379 0.017374 *
## InternetServiceFiber optic 6.972e-01 1.698e-01 4.105 4.04e-05 ***
## InternetServiceNo -4.297e-01 2.256e-01 -1.904 0.056887 .
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -5.894e-01 1.017e-01 -5.793 6.92e-09 ***
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes -1.862e-01 9.982e-02 -1.865 0.062138 .
## ContractOne year -6.593e-01 1.277e-01 -5.165 2.41e-07 ***
## ContractTwo year -1.306e+00 1.987e-01 -6.571 4.98e-11 ***
## PaperlessBillingYes 2.766e-01 8.621e-02 3.209 0.001334 **
## PaymentMethodCredit card (automatic) -1.345e-01 1.347e-01 -0.999 0.317864
## PaymentMethodElectronic check 4.146e-01 1.106e-01 3.749 0.000177 ***
## PaymentMethodMailed check -8.033e-02 1.350e-01 -0.595 0.551723
## MonthlyCharges 6.082e-03 5.336e-03 1.140 0.254335
## TotalCharges 4.083e-04 8.289e-05 4.926 8.38e-07 ***
## GenderMale -1.404e-02 7.556e-02 -0.186 0.852555
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6019.5 on 5189 degrees of freedom
## Residual deviance: 4292.1 on 5170 degrees of freedom
## AIC: 4332.1
##
## Number of Fisher Scoring iterations: 6
Significant factors associated with customers discontinuing service:
Shorter tenure: Customers with fewer months of service are more likely to leave.
Fiber optic internet: Customers using this service are more likely to discontinue.
No phone service: Associated with a higher chance of leaving.
No online security: Strongly linked to customer departure.
Month-to-month contracts: Customers on these plans are more likely to discontinue, compared to those with one- or two-year contracts.
Electronic check payment: This method is associated with a higher departure rate.
Lower total charges: Often linked with newer or less-engaged customers.
Factors that were not statistically significant:
Gender
Partner status
Having dependents
Mailed check payment
Credit card payment
Device protection
Note: Some factor levels (e.g., “No phone service” under MultipleLines) were automatically excluded by R due to multicollinearity with other variables. These exclusions are expected and do not affect model integrity.
# Predict probabilities from the full model
predicted_probs <- predict(log_model_full, newdata = test_data, type = "response")
# Load the pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predict probabilities on the test set
predicted_probs <- predict(log_model_full, newdata = test_data, type = "response")
# Generate ROC curve and calculate AUC
roc_obj <- roc(test_data$Status, predicted_probs)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
# Print AUC
auc(roc_obj)
## Area under the curve: 0.8446
# Plot ROC curve
plot(roc_obj, col = "blue", main = "ROC Curve - Full Logistic Regression Model")
Model 2
# Model 2
log_model_reduced <- glm(Status ~ SeniorCitizen + Tenure + PhoneService + MultipleLines +
InternetService + OnlineSecurity + Contract + PaperlessBilling +
PaymentMethod + MonthlyCharges + TotalCharges,
data = train_data,
family = "binomial")
summary(log_model_reduced)
##
## Call:
## glm(formula = Status ~ SeniorCitizen + Tenure + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges,
## family = "binomial", data = train_data)
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.245e-01 2.644e-01 0.471 0.637643
## SeniorCitizen1 2.907e-01 9.522e-02 3.053 0.002265 **
## Tenure -6.932e-02 7.392e-03 -9.378 < 2e-16 ***
## PhoneServiceYes -8.410e-01 1.744e-01 -4.822 1.42e-06 ***
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.521e-01 9.512e-02 2.650 0.008046 **
## InternetServiceFiber optic 8.072e-01 1.615e-01 4.997 5.81e-07 ***
## InternetServiceNo -5.271e-01 2.205e-01 -2.391 0.016811 *
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -5.711e-01 1.007e-01 -5.669 1.44e-08 ***
## ContractOne year -6.808e-01 1.273e-01 -5.349 8.87e-08 ***
## ContractTwo year -1.349e+00 1.979e-01 -6.816 9.37e-12 ***
## PaperlessBillingYes 2.849e-01 8.602e-02 3.312 0.000926 ***
## PaymentMethodCredit card (automatic) -1.305e-01 1.345e-01 -0.971 0.331685
## PaymentMethodElectronic check 4.293e-01 1.104e-01 3.890 0.000100 ***
## PaymentMethodMailed check -7.842e-02 1.347e-01 -0.582 0.560524
## MonthlyCharges 1.968e-03 4.904e-03 0.401 0.688137
## TotalCharges 4.154e-04 8.306e-05 5.001 5.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6019.5 on 5189 degrees of freedom
## Residual deviance: 4300.5 on 5174 degrees of freedom
## AIC: 4332.5
##
## Number of Fisher Scoring iterations: 6
# Predict probabilities
reduced_probs <- predict(log_model_reduced, newdata = test_data, type = "response")
# AUC with pROC
library(pROC)
roc_reduced <- roc(test_data$Status, reduced_probs)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(roc_reduced)
## Area under the curve: 0.8465
# Plot ROC
plot(roc_reduced, col = "darkgreen", main = "ROC Curve - Reduced Model")
After trying several model; - The logistic regression model achieved an AUC of 0.8446 which is highest so we will proceed with this.
# Step 1: Predict probabilities from logistic regression model
log_probs <- predict(log_model_reduced, newdata = test_data, type = "response")
# Threshold = 0.4
log_preds_th0.4 <- ifelse(log_probs > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(log_preds_th0.4, test_data$Status))
##
## log_preds_th0.4 Current Left
## Current 813 122
## Left 138 225
cat("Prediction Accuracy (0.4):", mean(log_preds_th0.4 == test_data$Status), "\n\n")
## Prediction Accuracy (0.4): 0.7996918
# Threshold = 0.5
log_preds_th0.5 <- ifelse(log_probs > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(log_preds_th0.5, test_data$Status))
##
## log_preds_th0.5 Current Left
## Current 862 167
## Left 89 180
cat("Prediction Accuracy (0.5):", mean(log_preds_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.8027735
# Threshold = 0.6
log_preds_th0.6 <- ifelse(log_probs > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(log_preds_th0.6, test_data$Status))
##
## log_preds_th0.6 Current Left
## Current 907 225
## Left 44 122
cat("Prediction Accuracy (0.6):", mean(log_preds_th0.6 == test_data$Status), "\n\n")
## Prediction Accuracy (0.6): 0.7927581
# Create summary data frame
logit_threshold_df <- data.frame(
Threshold = c(0.4, 0.5, 0.6),
Accuracy = c(0.7996918, 0.8027735, 0.7927581)
)
# View the table
print(logit_threshold_df)
## Threshold Accuracy
## 1 0.4 0.7996918
## 2 0.5 0.8027735
## 3 0.6 0.7927581
log_model_reduced;Model 2) at multiple probability
thresholds to determine the best cutoff for prediction. As shown in the
table below, the threshold of 0.5 yielded the highest
prediction accuracy of 80.28%.# Load the package to access Naive Bayes
library(e1071)
library(ggplot2)
# Load ggplot2 if not already loaded
library(ggplot2)
# Histogram for Tenure
ggplot(Model_Building_Data) +
geom_histogram(aes(x = Tenure), bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Tenure", x = "Tenure (Months)", y = "Count")
# Histogram for MonthlyCharges
ggplot(Model_Building_Data) +
geom_histogram(aes(x = MonthlyCharges), bins = 30, fill = "darkorange", color = "white") +
labs(title = "Distribution of Monthly Charges", x = "Monthly Charges ($)", y = "Count")
# Histogram for TotalCharges
ggplot(Model_Building_Data) +
geom_histogram(aes(x = TotalCharges), bins = 30, fill = "seagreen", color = "white") +
labs(title = "Distribution of Total Charges", x = "Total Charges ($)", y = "Count")
# create binned variables after train-test split
train_data <- train_data %>%
mutate(
Tenure_binned = ifelse(Tenure < 20, "Short",
ifelse(Tenure > 50, "Long", "Medium")),
MonthlyCharges_binned = ifelse(MonthlyCharges < 35, "Low",
ifelse(MonthlyCharges > 85, "High", "Medium")),
TotalCharges_binned = ifelse(TotalCharges < 1500, "Low",
ifelse(TotalCharges > 5000, "High", "Medium"))
) %>%
mutate(
Tenure_binned = as.factor(Tenure_binned),
MonthlyCharges_binned = as.factor(MonthlyCharges_binned),
TotalCharges_binned = as.factor(TotalCharges_binned)
)
test_data <- test_data %>%
mutate(
Tenure_binned = ifelse(Tenure < 20, "Short",
ifelse(Tenure > 50, "Long", "Medium")),
MonthlyCharges_binned = ifelse(MonthlyCharges < 35, "Low",
ifelse(MonthlyCharges > 85, "High", "Medium")),
TotalCharges_binned = ifelse(TotalCharges < 1500, "Low",
ifelse(TotalCharges > 5000, "High", "Medium"))
) %>%
mutate(
Tenure_binned = as.factor(Tenure_binned),
MonthlyCharges_binned = as.factor(MonthlyCharges_binned),
TotalCharges_binned = as.factor(TotalCharges_binned)
)
str(Model_Building_Data)
## 'data.frame': 6488 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Status : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 459 700 873 1007 1249 3090 3546 4067 4844 6196 ...
## ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
Model L1
# Full Naive Bayes model with all relevant variables
nb_model_full <- naiveBayes(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
StreamingMovies + Contract + PaperlessBilling + PaymentMethod + Gender +
Tenure_binned + MonthlyCharges_binned + TotalCharges_binned,
data = train_data,
laplace = 0.01)
# View summary
nb_model_full
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Current Left
## 0.7333333 0.2666667
##
## Conditional probabilities:
## SeniorCitizen
## Y 0 1
## Current 0.8707285 0.1292715
## Left 0.7377133 0.2622867
##
## Partner
## Y No Yes
## Current 0.4753023 0.5246977
## Left 0.6473967 0.3526033
##
## Dependents
## Y No Yes
## Current 0.6563313 0.3436687
## Left 0.8345327 0.1654673
##
## PhoneService
## Y No Yes
## Current 0.09695430 0.90304570
## Left 0.08960131 0.91039869
##
## MultipleLines
## Y No No phone service Yes
## Current 0.49474387 0.09695404 0.40830209
## Left 0.45953484 0.08960066 0.45086450
##
## InternetService
## Y DSL Fiber optic No
## Current 0.37283206 0.35128730 0.27588064
## Left 0.23266114 0.70519425 0.06214461
##
## OnlineSecurity
## Y No No internet service Yes
## Current 0.39174941 0.27588064 0.33236995
## Left 0.78900746 0.06214461 0.14884793
##
## DeviceProtection
## Y No No internet service Yes
## Current 0.36100872 0.27588064 0.36311064
## Left 0.65822995 0.06214461 0.27962544
##
## StreamingMovies
## Y No No internet service Yes
## Current 0.34997359 0.27588064 0.37414576
## Left 0.50433155 0.06214461 0.43352384
##
## Contract
## Y Month-to-month One year Two year
## Current 0.43326248 0.24855558 0.31818194
## Left 0.88799376 0.08382044 0.02818581
##
## PaperlessBilling
## Y No Yes
## Current 0.4611142 0.5388858
## Left 0.2608416 0.7391584
##
## PaymentMethod
## Y Bank transfer (automatic) Credit card (automatic) Electronic check
## Current 0.2451393 0.2490804 0.2482922
## Left 0.1315063 0.1170559 0.5895856
## PaymentMethod
## Y Mailed check
## Current 0.2574881
## Left 0.1618523
##
## Gender
## Y Female Male
## Current 0.4926432 0.5073568
## Left 0.5072253 0.4927747
##
## Tenure_binned
## Y Long Medium Short
## Current 0.3691537 0.3258014 0.3050449
## Left 0.1011611 0.2290485 0.6697904
##
## MonthlyCharges_binned
## Y High Low Medium
## Current 0.2777198 0.2992646 0.4230156
## Left 0.3995650 0.1033287 0.4971063
##
## TotalCharges_binned
## Y High Low Medium
## Current 0.18444679 0.46952073 0.34603248
## Left 0.08237538 0.67268051 0.24494411
library(pROC)
# Predict probabilities for "Left"
nb_probs_full <- predict(nb_model_full, newdata = test_data, type = "raw")[, "Left"]
# Compute ROC object
nb_roc_full <- roc(response = test_data$Status,
predictor = nb_probs_full,
levels = c("Current", "Left"),
direction = "<")
# Plot the ROC curve
plot(nb_roc_full, main = "Naive Bayes – Full Model ROC Curve", col = "darkred", lwd = 3)
# Show AUC value
auc(nb_roc_full)
## Area under the curve: 0.8239
Model N2
# Train Naive Bayes model
model_r <- naiveBayes(Status ~ Tenure_binned + MonthlyCharges_binned + TotalCharges_binned +
Contract + InternetService + OnlineSecurity + MultipleLines +
PhoneService + PaperlessBilling + PaymentMethod +
SeniorCitizen + StreamingMovies + Partner,
data = train_data, laplace = 0.01)
model_r
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Current Left
## 0.7333333 0.2666667
##
## Conditional probabilities:
## Tenure_binned
## Y Long Medium Short
## Current 0.3691537 0.3258014 0.3050449
## Left 0.1011611 0.2290485 0.6697904
##
## MonthlyCharges_binned
## Y High Low Medium
## Current 0.2777198 0.2992646 0.4230156
## Left 0.3995650 0.1033287 0.4971063
##
## TotalCharges_binned
## Y High Low Medium
## Current 0.18444679 0.46952073 0.34603248
## Left 0.08237538 0.67268051 0.24494411
##
## Contract
## Y Month-to-month One year Two year
## Current 0.43326248 0.24855558 0.31818194
## Left 0.88799376 0.08382044 0.02818581
##
## InternetService
## Y DSL Fiber optic No
## Current 0.37283206 0.35128730 0.27588064
## Left 0.23266114 0.70519425 0.06214461
##
## OnlineSecurity
## Y No No internet service Yes
## Current 0.39174941 0.27588064 0.33236995
## Left 0.78900746 0.06214461 0.14884793
##
## MultipleLines
## Y No No phone service Yes
## Current 0.49474387 0.09695404 0.40830209
## Left 0.45953484 0.08960066 0.45086450
##
## PhoneService
## Y No Yes
## Current 0.09695430 0.90304570
## Left 0.08960131 0.91039869
##
## PaperlessBilling
## Y No Yes
## Current 0.4611142 0.5388858
## Left 0.2608416 0.7391584
##
## PaymentMethod
## Y Bank transfer (automatic) Credit card (automatic) Electronic check
## Current 0.2451393 0.2490804 0.2482922
## Left 0.1315063 0.1170559 0.5895856
## PaymentMethod
## Y Mailed check
## Current 0.2574881
## Left 0.1618523
##
## SeniorCitizen
## Y 0 1
## Current 0.8707285 0.1292715
## Left 0.7377133 0.2622867
##
## StreamingMovies
## Y No No internet service Yes
## Current 0.34997359 0.27588064 0.37414576
## Left 0.50433155 0.06214461 0.43352384
##
## Partner
## Y No Yes
## Current 0.4753023 0.5246977
## Left 0.6473967 0.3526033
## Predict probabilities for "Left"
probs_r <- predict(model_r, newdata = test_data, type = "raw")[, "Left"]
# Compute ROC object
roc_r <- roc(response = test_data$Status,
predictor = probs_r,
levels = c("Current", "Left"),
direction = "<")
# Plot the ROC curve
plot(roc_r, main = "Naive Bayes – Model R", col = "darkgreen", lwd = 3)
# Show AUC value
auc(roc_r)
## Area under the curve: 0.8316
AUC for this model = 0.8316
We will use moodel_r ( after removing some varaibles to achieve highest AUC) for further analysis.
# Step 1: Get posterior probabilities from Model R
probs_r_all <- predict(model_r, newdata = test_data, type = "raw")
probs_r_left <- probs_r_all[, "Left"] # probability of customer leaving
# Threshold = 0.5
pred_r_th0.5 <- ifelse(probs_r_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_r_th0.5, test_data$Status))
##
## pred_r_th0.5 Current Left
## Current 736 87
## Left 215 260
cat("Prediction Accuracy (0.5):", mean(pred_r_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.7673344
# Threshold = 0.7
pred_r_th0.7 <- ifelse(probs_r_left > 0.7, "Left", "Current")
cat("Confusion Matrix – Threshold 0.7:\n")
## Confusion Matrix – Threshold 0.7:
print(table(pred_r_th0.7, test_data$Status))
##
## pred_r_th0.7 Current Left
## Current 804 130
## Left 147 217
cat("Prediction Accuracy (0.7):", mean(pred_r_th0.7 == test_data$Status), "\n\n")
## Prediction Accuracy (0.7): 0.7865948
# Threshold = 0.8
pred_r_th0.8 <- ifelse(probs_r_left > 0.8, "Left", "Current")
cat("Confusion Matrix – Threshold 0.8:\n")
## Confusion Matrix – Threshold 0.8:
print(table(pred_r_th0.8, test_data$Status))
##
## pred_r_th0.8 Current Left
## Current 837 158
## Left 114 189
cat("Prediction Accuracy (0.8):", mean(pred_r_th0.8 == test_data$Status), "\n\n")
## Prediction Accuracy (0.8): 0.7904468
# Create data frame with prediction accuracy at each threshold
nb_threshold_results <- data.frame(
Threshold = c(0.5, 0.7, 0.8),
Accuracy = c(0.7673344, 0.7865948, 0.7904468)
)
# Print the summary
print(nb_threshold_results)
## Threshold Accuracy
## 1 0.5 0.7673344
## 2 0.7 0.7865948
## 3 0.8 0.7904468
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Build the full LDA model
lda_model_full <- lda(Status ~ SeniorCitizen + Partner + Dependents + Tenure +
PhoneService + MultipleLines + InternetService +
OnlineSecurity + DeviceProtection + Contract +
PaperlessBilling + PaymentMethod + MonthlyCharges +
TotalCharges + Gender,
data = train_data)
## Warning in lda.default(x, grouping, ...): variables are collinear
# Predict on test data
lda_pred_full <- predict(lda_model_full, newdata = test_data)
# Extract probabilities for class "Left"
probs_full <- lda_pred_full$posterior[, "Left"]
# Generate ROC curve
roc_full <- roc(response = test_data$Status,
predictor = probs_full,
levels = c("Current", "Left"),
direction = "<")
# Plot ROC curve
plot(roc_full, main = "ROC – Full LDA Model", col = "purple", lwd = 2)
# Show AUC
auc(roc_full)
## Area under the curve: 0.8381
# This will select only numeric columns from train_data
numeric_vars <- train_data %>%
select_if(is.numeric)
# View names of numeric variables
names(numeric_vars)
## [1] "Tenure" "MonthlyCharges" "TotalCharges"
# Compute correlation matrix
cor_matrix <- cor(numeric_vars)
# View correlation matrix
round(cor_matrix, 2)
## Tenure MonthlyCharges TotalCharges
## Tenure 1.00 0.24 0.82
## MonthlyCharges 0.24 1.00 0.65
## TotalCharges 0.82 0.65 1.00
The correlation between Tenure and TotalCharges is very high (0.82), indicating potential multicollinearity.
Similarly, TotalCharges is moderately correlated with MonthlyCharges (0.65).
Handling categorical varaibles
table(train_data$PhoneService, train_data$MultipleLines)
##
## No No phone service Yes
## No 0 493 0
## Yes 2519 0 2178
table(train_data$InternetService, train_data$OnlineSecurity)
##
## No No internet service Yes
## DSL 892 0 849
## Fiber optic 1691 0 622
## No 0 1136 0
-There seems to be Perfect multicollinearity between service-related variables (e.g., MultipleLines and PhoneService, OnlineSecurity and InternetService)
Handling Multicollinearity in LDA Model
While building the Linear Discriminant Analysis (LDA) model, we initially encountered a warning:
To address multicollinearity among highly correlated variables (e.g., Tenure, MonthlyCharges, and TotalCharges), we avoid including all of them in the same model. Instead, we experiment with different combinations — adding one while removing others — to find the best-performing model based on AUC, while ensuring model stability and interpretability.
We will try to build model to get highest AUC while handling multicollinearity.
# Model R2: Keep MonthlyCharges + TotalCharges, remove Tenure
lda_model_r2 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + MonthlyCharges +
TotalCharges + Gender,
data = train_data)
lda_model_r2 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + MonthlyCharges +
TotalCharges + Gender,
data = train_data)
lda_pred_r2 <- predict(lda_model_r2, newdata = test_data)
probs_r <- lda_pred_r2$posterior[, "Left"]
roc_r2 <- roc(response = test_data$Status,
predictor = probs_r,
levels = c("Current", "Left"),
direction = "<")
plot(roc_r2, main = "ROC – LDA Model R2", col = "darkgreen", lwd = 2)
auc(roc_r2)
## Area under the curve: 0.8402
# Model R3: Keep Tenure + TotalCharges, remove MonthlyCharges
lda_model_r3 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + Tenure +
TotalCharges + Gender,
data = train_data)
lda_model_r3 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + Tenure +
TotalCharges + Gender,
data = train_data)
lda_pred_r3 <- predict(lda_model_r3, newdata = test_data)
probs_r3 <- lda_pred_r3$posterior[, "Left"]
roc_r3 <- roc(response = test_data$Status,
predictor = probs_r3,
levels = c("Current", "Left"),
direction = "<")
plot(roc_r3, main = "ROC – LDA Model R3", col = "orange", lwd = 2)
auc(roc_r3)
## Area under the curve: 0.839
# Model R4: Add TotalCharges, drop Dependents
lda_model_r4 <- lda(Status ~ SeniorCitizen + Partner +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + Tenure +
MonthlyCharges + TotalCharges + Gender,
data = train_data)
lda_model_r4 <- lda(Status ~ SeniorCitizen + Partner +
PhoneService + InternetService + Contract +
PaperlessBilling + PaymentMethod + Tenure +
MonthlyCharges + TotalCharges + Gender,
data = train_data)
lda_pred_r4 <- predict(lda_model_r4, newdata = test_data)
probs_r4 <- lda_pred_r4$posterior[, "Left"]
roc_r4 <- roc(response = test_data$Status,
predictor = probs_r4,
levels = c("Current", "Left"),
direction = "<")
plot(roc_r4, main = "ROC – LDA Model R4", col = "red", lwd = 2)
auc(roc_r4)
## Area under the curve: 0.8405
Area under the curve: 0.8405
We will proceed with model with highest AUC which in this case is model_r4 with AUC of 0.8405.
Summary Output for model_r4
lda_model_r4
## Call:
## lda(Status ~ SeniorCitizen + Partner + PhoneService + InternetService +
## Contract + PaperlessBilling + PaymentMethod + Tenure + MonthlyCharges +
## TotalCharges + Gender, data = train_data)
##
## Prior probabilities of groups:
## Current Left
## 0.7333333 0.2666667
##
## Group means:
## SeniorCitizen1 PartnerYes PhoneServiceYes InternetServiceFiber optic
## Current 0.1292696 0.5246978 0.9030478 0.3512874
## Left 0.2622832 0.3526012 0.9104046 0.7052023
## InternetServiceNo ContractOne year ContractTwo year PaperlessBillingYes
## Current 0.27588019 0.24855491 0.31818182 0.5388860
## Left 0.06213873 0.08381503 0.02817919 0.7391618
## PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## Current 0.2490804 0.2482922
## Left 0.1170520 0.5895954
## PaymentMethodMailed check Tenure MonthlyCharges TotalCharges
## Current 0.2574882 37.46926 61.26146 2533.272
## Left 0.1618497 17.80347 74.55751 1528.829
## GenderMale
## Current 0.5073568
## Left 0.4927746
##
## Coefficients of linear discriminants:
## LD1
## SeniorCitizen1 0.3207039478
## PartnerYes -0.0558106716
## PhoneServiceYes -0.4188408378
## InternetServiceFiber optic 0.8382195042
## InternetServiceNo -0.0758157927
## ContractOne year -0.5973819289
## ContractTwo year -0.4424887962
## PaperlessBillingYes 0.2288347463
## PaymentMethodCredit card (automatic) -0.0477659464
## PaymentMethodElectronic check 0.4962240694
## PaymentMethodMailed check -0.0716122280
## Tenure -0.0100363576
## MonthlyCharges 0.0117782650
## TotalCharges -0.0002257877
## GenderMale -0.0100234500
# Step 1: Get posterior probabilities from Model R4
lda_pred_r4 <- predict(lda_model_r4, newdata = test_data)
probs_r4_left <- lda_pred_r4$posterior[, "Left"] # Probabilities for "Left" class
# --- Threshold = 0.4 ---
pred_r4_th0.4 <- ifelse(probs_r4_left > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(pred_r4_th0.4, test_data$Status))
##
## pred_r4_th0.4 Current Left
## Current 825 135
## Left 126 212
cat("Prediction Accuracy (0.4):", mean(pred_r4_th0.4 == test_data$Status), "\n\n")
## Prediction Accuracy (0.4): 0.7989214
# --- Threshold = 0.5 ---
pred_r4_th0.5 <- ifelse(probs_r4_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_r4_th0.5, test_data$Status))
##
## pred_r4_th0.5 Current Left
## Current 862 173
## Left 89 174
cat("Prediction Accuracy (0.5):", mean(pred_r4_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.798151
# --- Threshold = 0.6 ---
pred_r4_th0.6 <- ifelse(probs_r4_left > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(pred_r4_th0.6, test_data$Status))
##
## pred_r4_th0.6 Current Left
## Current 892 208
## Left 59 139
cat("Prediction Accuracy (0.6):", mean(pred_r4_th0.6 == test_data$Status), "\n\n")
## Prediction Accuracy (0.6): 0.7942989
# Create a data frame to summarize thresholds and accuracy
lda_threshold_results <- data.frame(
Threshold = c(0.4, 0.5, 0.6),
Accuracy = c(0.7989214, 0.798151, 0.7942989)
)
# View the summary table
print(lda_threshold_results)
## Threshold Accuracy
## 1 0.4 0.7989214
## 2 0.5 0.7981510
## 3 0.6 0.7942989
We encountered the error
rank deficiency in group Current while building the QDA
full model, which typically occurs when predictors are highly correlated
or some factor levels have very few observations. To resolve this, we
removed or replaced problematic variables to ensure the model runs
without errors.
Quadratic Discriminant Analysis (QDA) is very sensitive to multicollinearity and low variation in predictor levels, which can cause errors if some variables are too similar or rarely used.
# Model Q1
qda_q1 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
InternetService + Contract + PaperlessBilling +
PaymentMethod + MonthlyCharges + Gender,
data = train_data)
# Predict probabilities from Q1
qda_pred_q1 <- predict(qda_q1, newdata = test_data)
qda_probs_q1 <- qda_pred_q1$posterior[, "Left"]
# ROC & AUC for Q1
roc_q1 <- roc(response = test_data$Status,
predictor = qda_probs_q1,
levels = c("Current", "Left"),
direction = "<")
# Plot ROC curve
plot(roc_q1, main = "ROC Curve – QDA Model Q1", col = "darkblue", lwd = 2)
# Print AUC
auc(roc_q1)
## Area under the curve: 0.8266
# Model Q2
qda_q2 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
InternetService + Contract + PaperlessBilling +
PaymentMethod + Tenure + Gender,
data = train_data)
# Predict probabilities from Q2
qda_pred_q2 <- predict(qda_q2, newdata = test_data)
qda_probs_q2 <- qda_pred_q2$posterior[, "Left"]
# ROC & AUC for Q2
roc_q2 <- roc(response = test_data$Status,
predictor = qda_probs_q2,
levels = c("Current", "Left"),
direction = "<")
# Plot ROC curve
plot(roc_q2, main = "ROC Curve – QDA Model Q2", col = "darkgreen", lwd = 2)
# Print AUC
auc(roc_q2)
## Area under the curve: 0.8391
# Model Q3
qda_q3 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
InternetService + Contract + PaperlessBilling +
PaymentMethod + MonthlyCharges + Tenure + Gender,data = train_data)
# Predict probabilities from Q3
qda_pred_q3 <- predict(qda_q3, newdata = test_data)
# Extract probability for class "Left"
qda_probs_q3 <- qda_pred_q3$posterior[, "Left"]
# Compute ROC and AUC
library(pROC)
roc_q3 <- roc(response = test_data$Status,
predictor = qda_probs_q3,
levels = c("Current", "Left"),
direction = "<")
# Plot ROC curve
plot(roc_q3, main = "ROC Curve – QDA Model Q3", col = "purple", lwd = 2)
# Show AUC value
auc(roc_q3)
## Area under the curve: 0.8407
Area under the curve: 0.8407
We will proceed with model with highest AUC which in this case is model Q3 with AUC of 0.8407.
Summary output for Model Q3
qda_q3
## Call:
## qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
## InternetService + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + Tenure + Gender, data = train_data)
##
## Prior probabilities of groups:
## Current Left
## 0.7333333 0.2666667
##
## Group means:
## SeniorCitizen1 PartnerYes DependentsYes PhoneServiceYes
## Current 0.1292696 0.5246978 0.3436679 0.9030478
## Left 0.2622832 0.3526012 0.1654624 0.9104046
## InternetServiceFiber optic InternetServiceNo ContractOne year
## Current 0.3512874 0.27588019 0.24855491
## Left 0.7052023 0.06213873 0.08381503
## ContractTwo year PaperlessBillingYes
## Current 0.31818182 0.5388860
## Left 0.02817919 0.7391618
## PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## Current 0.2490804 0.2482922
## Left 0.1170520 0.5895954
## PaymentMethodMailed check MonthlyCharges Tenure GenderMale
## Current 0.2574882 61.26146 37.46926 0.5073568
## Left 0.1618497 74.55751 17.80347 0.4927746
# Get posterior probabilities for "Left"
qda_pred_q3 <- predict(qda_q3, newdata = test_data)
qda_probs_q3_left <- qda_pred_q3$posterior[, "Left"]
# Threshold = 0.4
pred_q3_th0.4 <- ifelse(qda_probs_q3_left > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(pred_q3_th0.4, test_data$Status))
##
## pred_q3_th0.4 Current Left
## Current 716 83
## Left 235 264
cat("Prediction Accuracy (0.4):", round(mean(pred_q3_th0.4 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.4): 0.755
# Threshold = 0.5
pred_q3_th0.5 <- ifelse(qda_probs_q3_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_q3_th0.5, test_data$Status))
##
## pred_q3_th0.5 Current Left
## Current 739 92
## Left 212 255
cat("Prediction Accuracy (0.5):", round(mean(pred_q3_th0.5 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.5): 0.7658
# Threshold = 0.6
pred_q3_th0.6 <- ifelse(qda_probs_q3_left > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(pred_q3_th0.6, test_data$Status))
##
## pred_q3_th0.6 Current Left
## Current 770 107
## Left 181 240
cat("Prediction Accuracy (0.6):", round(mean(pred_q3_th0.6 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.6): 0.7781
# Create data frame
qda_q3_accuracy_df <- data.frame(
Threshold = c(0.4, 0.5, 0.6),
Accuracy = c(0.7550, 0.7658, 0.7781)
)
# View it
print(qda_q3_accuracy_df)
## Threshold Accuracy
## 1 0.4 0.7550
## 2 0.5 0.7658
## 3 0.6 0.7781
library(tree) # For building decision trees
# Build decision tree model
dt_model <- tree(Status ~ SeniorCitizen + Partner + Dependents + Tenure +
PhoneService + MultipleLines + InternetService + OnlineSecurity +
DeviceProtection + Contract + PaperlessBilling + PaymentMethod +
MonthlyCharges + TotalCharges + Gender,
data = train_data)
# Summary of the model
summary(dt_model)
##
## Classification tree:
## tree(formula = Status ~ SeniorCitizen + Partner + Dependents +
## Tenure + PhoneService + MultipleLines + InternetService +
## OnlineSecurity + DeviceProtection + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges + Gender, data = train_data)
## Variables actually used in tree construction:
## [1] "Contract" "InternetService" "Tenure"
## Number of terminal nodes: 6
## Residual mean deviance: 0.8761 = 4542 / 5184
## Misclassification error rate: 0.2094 = 1087 / 5190
# Plot the tree
plot(dt_model)
text(dt_model, pretty = 0)
# Predict class labels
dt_preds <- predict(dt_model, newdata = test_data, type = "class")
# Confusion matrix
confusion_matrix_dt <- table(dt_preds, test_data$Status)
print(confusion_matrix_dt)
##
## dt_preds Current Left
## Current 903 229
## Left 48 118
# Prediction accuracy
accuracy_dt <- mean(dt_preds == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree):", accuracy_dt, "\n")
## Prediction Accuracy (Unpruned Tree): 0.7865948
# Step 1: Perform cross-validation to determine optimal size
cv.churn <- cv.tree(dt_model, FUN = prune.misclass)
# Step 2: Plot tree size vs misclassification error
plot(cv.churn$size, cv.churn$dev, type = "b",
xlab = "Tree Size (Number of Terminal Nodes)",
ylab = "Cross-Validation Error",
main = "CV Error vs. Tree Size")
# Step 3: Identify optimal tree size
best_size <- cv.churn$size[which.min(cv.churn$dev)]
cat("Best tree size:", best_size, "\n")
## Best tree size: 6
We used cross-validation to determine the optimal complexity of the decision tree model. The CV Error vs. Tree Size plot indicated that a tree with 6 terminal nodes minimizes cross-validation error.
# Prune the tree to the best size (6 terminal nodes)
pruned_churn_tree <- prune.tree(dt_model, best = 6)
# Plot the pruned tree
plot(pruned_churn_tree)
text(pruned_churn_tree, pretty = 0)
# Make predictions on test set
pruned_preds <- predict(pruned_churn_tree, newdata = test_data, type = "class")
# Create confusion matrix
conf_matrix_pruned <- table(Predicted = pruned_preds, Actual = test_data$Status)
print(conf_matrix_pruned)
## Actual
## Predicted Current Left
## Current 903 229
## Left 48 118
# Calculate prediction accuracy
accuracy_pruned <- mean(pruned_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree - Size 6):", round(accuracy_pruned, 4), "\n")
## Prediction Accuracy (Pruned Tree - Size 6): 0.7866
This pruned model balances complexity and performance, avoiding overfitting while still providing accuracy of 78.66%
Bagging and Random Forest will hlep us to understand important predictors.
# Load required library
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# Set seed for reproducibility
set.seed(123)
# ------------------------- #
# 1. Bagging Model (mtry = total predictors)
# ------------------------- #
bag_model <- randomForest(Status ~ .,
data = train_data,
mtry = ncol(train_data) - 1, # all predictors
importance = TRUE,
ntree = 500)
# Predict on test set
bag_preds <- predict(bag_model, newdata = test_data)
# Confusion Matrix and Accuracy
cat("Confusion Matrix – Bagging:\n")
## Confusion Matrix – Bagging:
print(table(Predicted = bag_preds, Actual = test_data$Status))
## Actual
## Predicted Current Left
## Current 844 186
## Left 107 161
bag_accuracy <- mean(bag_preds == test_data$Status)
cat("Prediction Accuracy (Bagging):", round(bag_accuracy, 4), "\n\n")
## Prediction Accuracy (Bagging): 0.7743
# Variable Importance Plot
varImpPlot(bag_model, main = "Variable Importance – Bagging")
# ------------------------- #
# 2. Random Forest Model (mtry = √p)
# ------------------------- #
rf_model <- randomForest(Status ~ .,
data = train_data,
mtry = floor(sqrt(ncol(train_data) - 1)),
importance = TRUE,
ntree = 500)
# Predict on test data
rf_preds <- predict(rf_model, newdata = test_data)
# Confusion Matrix and Accuracy
cat("Confusion Matrix – Random Forest:\n")
## Confusion Matrix – Random Forest:
print(table(Predicted = rf_preds, Actual = test_data$Status))
## Actual
## Predicted Current Left
## Current 858 173
## Left 93 174
rf_accuracy <- mean(rf_preds == test_data$Status)
cat("Prediction Accuracy (Random Forest):", round(rf_accuracy, 4), "\n\n")
## Prediction Accuracy (Random Forest): 0.7951
# Variable Importance Plot
varImpPlot(rf_model, main = "Variable Importance – Random Forest")
Bagging – Variable Importance
Random Forest – Variable Importance
Lets try get to get best accuracy using important predictors.
# Load necessary library
library(tree)
# --------------------------
# Step 1: Build Model D1
# --------------------------
model_d1 <- tree(Status ~ TotalCharges + Contract + Tenure + MonthlyCharges,
data = train_data)
# Summary
summary(model_d1)
##
## Classification tree:
## tree(formula = Status ~ TotalCharges + Contract + Tenure + MonthlyCharges,
## data = train_data)
## Variables actually used in tree construction:
## [1] "Contract" "MonthlyCharges" "Tenure"
## Number of terminal nodes: 6
## Residual mean deviance: 0.886 = 4593 / 5184
## Misclassification error rate: 0.2139 = 1110 / 5190
# Plot the tree
plot(model_d1)
text(model_d1, pretty = 0)
# --------------------------
# Step 2: Predict & Accuracy
# --------------------------
pred_d1 <- predict(model_d1, newdata = test_data, type = "class")
# Confusion Matrix
conf_matrix_d1 <- table(Predicted = pred_d1, Actual = test_data$Status)
print(conf_matrix_d1)
## Actual
## Predicted Current Left
## Current 902 234
## Left 49 113
# Accuracy
accuracy_d1 <- mean(pred_d1 == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree – Model D1):", round(accuracy_d1, 4), "\n")
## Prediction Accuracy (Unpruned Tree – Model D1): 0.782
# --------------------------
# Step 3: Cross-Validation
# --------------------------
set.seed(123)
cv_d1 <- cv.tree(model_d1, FUN = prune.misclass)
# Plot CV Error vs Tree Size
plot(cv_d1$size, cv_d1$dev, type = "b",
xlab = "Tree Size (Terminal Nodes)",
ylab = "Cross-Validation Error",
main = "CV Error vs Tree Size – Model D1")
# Get best size
best_size_d1 <- cv_d1$size[which.min(cv_d1$dev)]
cat("Best tree size for Model D1:", best_size_d1, "\n")
## Best tree size for Model D1: 6
# --------------------------
# Step 4: Prune the tree
# --------------------------
pruned_d1 <- prune.tree(model_d1, best = best_size_d1)
# Plot pruned tree
plot(pruned_d1)
text(pruned_d1, pretty = 0)
# Predict using pruned tree
pruned_d1_preds <- predict(pruned_d1, newdata = test_data, type = "class")
# Confusion Matrix
conf_matrix_pruned_d1 <- table(Predicted = pruned_d1_preds, Actual = test_data$Status)
print(conf_matrix_pruned_d1)
## Actual
## Predicted Current Left
## Current 902 234
## Left 49 113
# Accuracy
accuracy_pruned_d1 <- mean(pruned_d1_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree – Model D1):", round(accuracy_pruned_d1, 4), "\n")
## Prediction Accuracy (Pruned Tree – Model D1): 0.782
# Load necessary library
library(tree)
# --------------------------
# Step 1: Build Model D2
# --------------------------
model_d2 <- tree(Status ~ TotalCharges + Contract + Tenure + MonthlyCharges +
InternetService + OnlineSecurity,
data = train_data)
# Summary
summary(model_d2)
##
## Classification tree:
## tree(formula = Status ~ TotalCharges + Contract + Tenure + MonthlyCharges +
## InternetService + OnlineSecurity, data = train_data)
## Variables actually used in tree construction:
## [1] "Contract" "InternetService" "Tenure"
## Number of terminal nodes: 6
## Residual mean deviance: 0.8761 = 4542 / 5184
## Misclassification error rate: 0.2094 = 1087 / 5190
# Plot the tree
plot(model_d2)
text(model_d2, pretty = 0)
# --------------------------
# Step 2: Predict & Accuracy
# --------------------------
pred_d2 <- predict(model_d2, newdata = test_data, type = "class")
# Confusion Matrix
conf_matrix_d2 <- table(Predicted = pred_d2, Actual = test_data$Status)
print(conf_matrix_d2)
## Actual
## Predicted Current Left
## Current 903 229
## Left 48 118
# Accuracy
acc_d2 <- mean(pred_d2 == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree – Model D2):", round(acc_d2, 4), "\n")
## Prediction Accuracy (Unpruned Tree – Model D2): 0.7866
# --------------------------
# Step 3: Cross-Validation
# --------------------------
set.seed(123)
cv_d2 <- cv.tree(model_d2, FUN = prune.misclass)
# Plot CV Error vs Tree Size
plot(cv_d2$size, cv_d2$dev, type = "b",
xlab = "Tree Size (Terminal Nodes)",
ylab = "Cross-Validation Error",
main = "CV Error vs Tree Size – Model D2")
# Get best size
best_size_d2 <- cv_d2$size[which.min(cv_d2$dev)]
cat("Best tree size for Model D2:", best_size_d2, "\n")
## Best tree size for Model D2: 6
# --------------------------
# Step 4: Prune the Tree
# --------------------------
pruned_d2 <- prune.tree(model_d2, best = best_size_d2)
# Plot pruned tree
plot(pruned_d2)
text(pruned_d2, pretty = 0)
# Predict using pruned tree
pruned_d2_preds <- predict(pruned_d2, newdata = test_data, type = "class")
# Confusion Matrix
conf_matrix_pruned_d2 <- table(Predicted = pruned_d2_preds, Actual = test_data$Status)
print(conf_matrix_pruned_d2)
## Actual
## Predicted Current Left
## Current 903 229
## Left 48 118
# Accuracy
accuracy_pruned_d2 <- mean(pruned_d2_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree – Model D2):", round(accuracy_pruned_d2, 4), "\n")
## Prediction Accuracy (Pruned Tree – Model D2): 0.7866
# Create a dataframe with prediction accuracies for different Decision Tree models
dt_results <- data.frame(
Model = c("Full Model (Unpruned)",
"Full Model (Pruned)",
"Model D1 (Unpruned)",
"Model D1 (Pruned)",
"Model D2 (Unpruned)",
"Model D2 (Pruned)"),
Accuracy = c(0.7866, 0.7866, 0.782, 0.782, 0.7866, 0.7866)
)
# View the results
print(dt_results)
## Model Accuracy
## 1 Full Model (Unpruned) 0.7866
## 2 Full Model (Pruned) 0.7866
## 3 Model D1 (Unpruned) 0.7820
## 4 Model D1 (Pruned) 0.7820
## 5 Model D2 (Unpruned) 0.7866
## 6 Model D2 (Pruned) 0.7866
Compare across methods (skip the model built with decision trees) used above and report your best method based on ROC plots.
# Create AUC comparison dataframe
auc_comparison <- data.frame(
Method = c("Logistic Regression (Model 2)",
"Naive Bayes (Model N2)",
"LDA (Model R4)",
"QDA (Model Q3)"),
AUC = c(0.8465, 0.8316, 0.8405, 0.8407)
)
# View the dataframe
print(auc_comparison)
## Method AUC
## 1 Logistic Regression (Model 2) 0.8465
## 2 Naive Bayes (Model N2) 0.8316
## 3 LDA (Model R4) 0.8405
## 4 QDA (Model Q3) 0.8407
Which model does the best in terms of the prediction accuracy on the test set? Include the decision tree model here.
# Create prediction accuracy dataframe
prediction_accuracy <- data.frame(
Method = c("Logistic Regression (Model 2)",
"Naive Bayes (Model N2)",
"LDA (Model R4)",
"QDA (Model Q3)",
"Decision Tree (Model D2: pruned)"),
Pred_accuracy = c(0.8028, 0.7904, 0.7989, 0.7781, 0.7866)
)
# View the dataframe
print(prediction_accuracy)
## Method Pred_accuracy
## 1 Logistic Regression (Model 2) 0.8028
## 2 Naive Bayes (Model N2) 0.7904
## 3 LDA (Model R4) 0.7989
## 4 QDA (Model Q3) 0.7781
## 5 Decision Tree (Model D2: pruned) 0.7866
In this section, we will use the Implementation_Data file to generate business recommendations. This data file has information on 500 current customers. Use your logistic regression model with your best threshold (identified in part 2) to answer the following questions:
# Load the dataset
load("Implementation_Data.rda")
str(Implementation_Data)
## 'data.frame': 500 obs. of 16 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 1 1 1 0 0 1 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 1 1 1 ...
## $ Tenure : int 45 64 26 23 26 2 67 71 2 72 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 1 3 1 1 1 3 1 3 1 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 3 2 1 3 3 3 1 2 1 3 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 2 3 1 2 2 2 3 3 1 2 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 2 1 3 2 2 2 1 1 1 2 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 2 3 3 2 2 2 1 1 1 2 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 3 3 2 2 2 1 3 2 1 3 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 2 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 1 4 3 2 4 1 2 2 2 ...
## $ MonthlyCharges : num 19.2 110.3 60.7 20.8 19.3 ...
## $ TotalCharges : num 904 6997 1597 485 504 ...
head(Implementation_Data)
# Convert SeniorCitizen column to a factor in the implementation dataset
# This ensures consistency with the model which was trained using SeniorCitizen as a factor
Implementation_Data <- Implementation_Data %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen))
# Get summary of the logistic regression model (Model 2)
summary(log_model_reduced)
##
## Call:
## glm(formula = Status ~ SeniorCitizen + Tenure + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges,
## family = "binomial", data = train_data)
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.245e-01 2.644e-01 0.471 0.637643
## SeniorCitizen1 2.907e-01 9.522e-02 3.053 0.002265 **
## Tenure -6.932e-02 7.392e-03 -9.378 < 2e-16 ***
## PhoneServiceYes -8.410e-01 1.744e-01 -4.822 1.42e-06 ***
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.521e-01 9.512e-02 2.650 0.008046 **
## InternetServiceFiber optic 8.072e-01 1.615e-01 4.997 5.81e-07 ***
## InternetServiceNo -5.271e-01 2.205e-01 -2.391 0.016811 *
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -5.711e-01 1.007e-01 -5.669 1.44e-08 ***
## ContractOne year -6.808e-01 1.273e-01 -5.349 8.87e-08 ***
## ContractTwo year -1.349e+00 1.979e-01 -6.816 9.37e-12 ***
## PaperlessBillingYes 2.849e-01 8.602e-02 3.312 0.000926 ***
## PaymentMethodCredit card (automatic) -1.305e-01 1.345e-01 -0.971 0.331685
## PaymentMethodElectronic check 4.293e-01 1.104e-01 3.890 0.000100 ***
## PaymentMethodMailed check -7.842e-02 1.347e-01 -0.582 0.560524
## MonthlyCharges 1.968e-03 4.904e-03 0.401 0.688137
## TotalCharges 4.154e-04 8.306e-05 5.001 5.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6019.5 on 5189 degrees of freedom
## Residual deviance: 4300.5 on 5174 degrees of freedom
## AIC: 4332.5
##
## Number of Fisher Scoring iterations: 6
# Step 1: Make probability predictions using your best logistic model (Model 2)
impl_probs <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")
# Step 2: Apply the best threshold (0.5) to classify
impl_predictions <- ifelse(impl_probs > 0.5, "Left", "Current")
# Step 3: Count how many customers are predicted to leave
table(impl_predictions)
## impl_predictions
## Current Left
## 389 111
num_at_risk <- sum(impl_predictions == "Left")
cat("Number of customers predicted to leave:", num_at_risk, "\n")
## Number of customers predicted to leave: 111
Using our best-performing logistic regression model (Model 2) and a probability threshold of 0.5, we identified that:
111 out of 500 current customers are predicted to be at risk of leaving the company.This represents 22.2% of the current customer base from the implementation dataset.
# Step 1: Predict probabilities using Logistic Regression Model 2
log_probs_impl <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")
# Step 2: Assign predicted classes using threshold = 0.5
predicted_classes <- ifelse(log_probs_impl > 0.5, "Left", "Current")
# Step 3: Filter customers predicted to leave
customers_to_leave <- Implementation_Data[predicted_classes == "Left", ]
# Step 4: Calculate total predicted monthly revenue loss
total_revenue_loss <- sum(customers_to_leave$MonthlyCharges)
# Step 5: Print result
cat("Total predicted monthly revenue loss if all predicted-to-leave customers leave: $", round(total_revenue_loss, 2), "\n")
## Total predicted monthly revenue loss if all predicted-to-leave customers leave: $ 8883.75
To answer this, we’ll look at the significant predictors from our logistic regression model (Model 2), especially those that are:
Based on the logistic regression results and business insight, the following predictors should be prioritized for customer retention strategies:
Customers on Month-to-month contracts are far more likely to leave. Action: Offer discounts or perks to encourage switching to 1-year or 2-year contracts.
Customers without online security services tend to leave more frequently. Action: Provide free trials or bundle online security features as add-ons to retain these customers.
Customers using Fiber Optic or with no internet service are at higher risk of leaving. Action: Promote stable, reliable internet plans, such as DSL or upgraded service packages.
Customers paying via Electronic Checks are more likely to leave. Action: Encourage auto-pay or card payment methods by offering small incentives or bill credits.
Customers not enrolled in paperless billing are more likely to leave. Action: Promote paperless billing with rewards or credits to drive adoption and retention.
Incentive Scheme Proposal to Reduce Customer Loss - Based on our logistic regression model (Model 2), we identified 111 current customers who are at risk of leaving the company. These customers contribute approximately $8,883.75 in monthly revenue. If no action is taken, this revenue will be lost.
Proposed Incentive
Offer a $20 incentive (e.g., bill credit, one-time discount, or premium add-on) to each at-risk customer to encourage retention. Options could include:
$20 off the next bill
Free one-month premium service or support
Loyalty rewards for contract renewal
Expected Outcome
Assuming a 50% retention rate, we expect to retain around half of the at-risk customers, preserving a portion of the monthly revenue.
# Step 1: Load the implementation dataset
# Step 2: Ensure variable types match training data
Implementation_Data <- Implementation_Data %>%
mutate(
SeniorCitizen = as.factor(SeniorCitizen),
Tenure = as.numeric(Tenure),
MonthlyCharges = as.numeric(MonthlyCharges),
TotalCharges = as.numeric(TotalCharges)
)
# Step 3: Predict probabilities using the final logistic regression model
impl_probs <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")
# Step 4: Assign class labels based on best threshold (e.g., 0.5)
impl_predicted_class <- ifelse(impl_probs >= 0.5, "Left", "Current")
# Step 5: Create a dataframe with predictions
Implementation_Data$Predicted_Status <- impl_predicted_class
Implementation_Data$Prob_Left <- impl_probs
# Step 6: Filter customers predicted to leave
predicted_leavers <- Implementation_Data %>%
filter(Predicted_Status == "Left")
# Step 7: Cost-benefit calculation
incentive_per_customer <- 20 # Proposed incentive in USD
estimated_retention_rate <- 0.5 # Assume 50% retention with incentives
# Total predicted monthly revenue from customers at risk
total_monthly_revenue <- sum(predicted_leavers$MonthlyCharges)
# Estimated retained revenue (50% of at-risk customers)
estimated_retained_revenue <- total_monthly_revenue * estimated_retention_rate
# Incentive cost = $20 * number of at-risk customers
total_incentive_cost <- nrow(predicted_leavers) * incentive_per_customer
# Net benefit
net_benefit <- estimated_retained_revenue - total_incentive_cost
# Output
cat("Number of At-Risk Customers:", nrow(predicted_leavers), "\n")
## Number of At-Risk Customers: 111
cat("Total Monthly Revenue at Risk: $", round(total_monthly_revenue, 2), "\n")
## Total Monthly Revenue at Risk: $ 8883.75
cat("Estimated Retained Revenue (50%): $", round(estimated_retained_revenue, 2), "\n")
## Estimated Retained Revenue (50%): $ 4441.88
cat("Total Incentive Cost: $", total_incentive_cost, "\n")
## Total Incentive Cost: $ 2220
cat("Net Benefit: $", round(net_benefit, 2), "\n")
## Net Benefit: $ 2221.88
Recommendation to Management - To mitigate the risk of losing approximately $8,883.75 in monthly revenue, we recommend implementing a retention incentive of $20 per customer for the 111 at-risk customers identified by our model.
By doing so:
We expect to retain at least 50% of these customers,
Resulting in a retained revenue of $4,441.88,
At a cost of $2,220.00 in incentives,
Yielding a net monthly benefit of $2,221.88.
This plan is cost-effective, data-driven, and aligned with our retention goals. It balances short-term costs with long-term revenue preservation and should be approved for immediate implementation.